Gene coexpression analysis in Arabidopsis thaliana based on public microarray data

Summary Coexpressed genes tend to participate in related biological processes. Gene coexpression analysis allows the discovery of functional gene partners or the assignment of biological roles to genes of unknown function. In this protocol, we describe the steps necessary to create a gene coexpression tree for Arabidopsis thaliana, using publicly available Affymetrix CEL microarray data. Because the computational analysis described here is highly dependent on sample quality, we detail an automatic quality control approach. For complete details on the use and execution of this protocol, please refer to Zogopoulos et al. (2021).

Note: If the user password is changed, then the password field inside /home/ACT/Parsers/ config.ini must also be changed h. Exit MySQL by typing: i. The system setup is complete and the users can begin the execution of the protocol.

Full installation
To perform a full installation, the following commands must be typed in Ubuntu terminal: 3. Install system updates, git, unzip and gunzip:

Install ACT scripts
Timing: 1 min 4. Download the necessary codes for this protocol from GitHub: Note: ACT folder is created automatically through git and all custom programs, scripts and files described in this protocol are included or produced inside.

Install R & RStudio
Timing: 15 min 5. R and optionally RStudio, need to be installed on the Ubuntu machine.
a. Install latest version of R and necessary dependencies: b. Install RStudio: i. Download the latest version of RStudio for Ubuntu: https://www.rstudio.com/products/ rstudio/download ii. Install the .deb file using:

Install MySQL and PHP
Timing: 10 min 6. MySQL database management system (DBMS) needs to be installed on the Ubuntu machine, to store and access data, using relevant SQL queries. a. Setup MySQL client and server: i. Access MySQL as root after installation using the following command in Ubuntu terminal: Note: Press ''Enter'' when asked for MySQL root password, unless a root password is already set up.
The following commands must be typed in MySQL environment: ii. Set up MySQL root password using: iii. Enable local file loading using: iv. Create a MySQL user account using: v. Create the database using: Note: In our example, the <database name> is ''Athaliana''.
7. PHP is used in this protocol to run all scripts necessary for data file parsing, format conversion and database access. a. Install PHP and MySQL module for PHP: b. Get data from MySQL using PHP: ACT/Parsers/config.ini includes the local MySQL credentials and should be changed accordingly.
Install MySQL workbench and create the ERD and the tables of the database

Timing: 30 min
MySQL Workbench is a visual database design tool to create the Entity-Relation Diagram (ERD) and the tables of the database.
8. Install MySQL Workbench: a. Install necessary packages: b. Download MySQL Workbench for Ubuntu 20.04: https://dev.mysql.com/downloads/ workbench/ c. Install the .deb file using: 9. The local MySQL database ERD is designed using MySQL Workbench and consists of 8 tables ( Figure 1). ''Expression'' will be used to save the expression values of each probeset in each sample. ''Probeset'' contains the association between probesets and genes. ''Probesets'' contains the probesets as well as a unique numeric ID associated with each probeset. ''ENSG'' contains the AGI code, gene symbol, gene name and brief descriptions of Arabidopsis thaliana genes. ''Selec-ted_Genes'' contains the AGI codes of genes that will be studied. ''Sample'' contains details of all samples. ''Samples'' contains the same details as ''Sample'', except there is a unique numeric ID associated with each sample. ''Selected_Samples'' contains the unique numeric ID of the representative samples after quality control is over. Athaliana.mwb is the MySQL workbench file of the proposed ERD in this protocol. a. Open a new Ubuntu terminal instance and access local MySQL database from using: CRITICAL: All commands requiring MySQL will be performed from this distinct terminal instance.
b. Create the tables by copying and pasting the SQL

Sample search and collection
Timing: 2 days The main aim of this protocol is to perform a condition-independent coexpression analysis for the species of interest. As such, the microarray samples need to be of various distinct healthy tissues and originate from the same Affymetrix microarray chip. Note: Try selecting an Affymetrix microarray chip that not only has a large number of openaccess samples in public repositories but also includes a significant proportion of the genes of the selected organism. For example, ATH1-121501 Genome Array chip studies $24,000 genes, while AG Affymetrix Arabidopsis Genome Array studies only $8000 genes.
2. Search public repositories for 'Arabidopsis thaliana' keyword and the platform code of the microarray chip. Each repository uses a different platform code for each chip. Furthermore, alternative platform codes of the same chip and repository must also be examined. a. Search ArrayExpress (Kolesnikov et al., 2015) for 'Arabidopsis thaliana' and 'A-AFFY-2' (the platform code of ATH1-121501 in ArrayExpress) keywords. b. Search Gene Expression Omnibus (GEO) for 'Arabidopsis thaliana' and 'GPL198' (the platform code of ATH1-121501 in GEO) keywords. c. Download all experiments from Nottingham Arabidopsis Stock Centre (NASC) (Craigon et al., 2004) repository using the following link: https://uniofnottm-my.sharepoint.com/: f:/g/personal/sean_may_nottingham_ac_uk/Ep5b_GCihv1Nu0EYxWpkZggBK-6kAgZjMfk-9JQWJPyXUg?e=nXhIrh CRITICAL: Arrange the series in different directories. Each series directory should contain the raw data (CEL files) of the samples of the series. <Series Main Directory> is the path of the directory that contains the directories of all series.
Pause point: Depending on the total number, study download might take a considerable amount of time.
a. Unzip any zipped and/or gzipped CEL files, delete folders of studies which do not contain any CEL files and convert binary CEL files to text files using the following command in Ubuntu bash: b. Check if platform is ''ATH1-121501'' in each CEL file and delete those CEL files which are of different platforms, using: c. Check and auto-delete duplicate CEL files, using: Checkpoint: When our analysis was performed (13 th June 2018), 19887 unique CEL files belonging to 1391 studies, were stored.

Sample normalization
Timing: 1d to 2 months (Varies depending on the sample number) The downloaded samples are normalized with Single Channel Array Normalization (SCAN). As the default Affymetrix CDF is outdated, the latest version of Brainarray (Dai et al., 2005) CDF is used. This guarantees that each probe set corresponds to a single gene and vice versa. Note: In our computer setup, SCAN needs about 2-3 min to run for each sample of Affymetrix ATH1-121501 chip. As such, SCAN total execution time depends on the total number of samples. A matrix containing all $20,000 genes of the CDF and their expression values in each sample of a series is produced as SCAN_matrix.txt inside each series directory.
Pause point: Depending on the total number of samples, SCAN might be running for days, weeks or more than a month. The users should leave the computer running during this time, thus, the use of Uninterruptible Power Supply (UPS) is highly advisable.
a. Navigate to ACT directory in Ubuntu terminal: Record the path of the ACT directory on your system. Replace it for <ACT path> in the following MySQL queries. b. Parse all SCAN_matrix.txt files to a single txt file (expression.txt) in a 4-column format, using: Checkpoint: expression.txt is a tab-delimited file of n•m lines and 4 columns, where n is the number of genes and m in the number of downloaded samples. c. Produce the .txt files for the tables of the database using: d. Insert the produced .txt files into local MySQL database in MySQL terminal using, the following commands in this order: Checkpoint: selected_genes.txt is a single column file of n lines, where n is the number of genes which are described by the BrainArray CDF.

Sample quality control
Timing: 1-2 weeks A series of quality controls with different metrics needs to be performed to guarantee high-quality samples. In addition, samples which come from whole plant experiments or are from infected or mutated samples need to be deleted from their directory. However, in most of the cases there is no way to parse programmatically the metadata for each sample, as their format may vary significantly, which can lead to loss of important sample details. Thus, extensive manual curation is required.
6. Acquire quality control metrics for each series in RStudio, using:  (Bolstad et al., 2005) for each sample of the study, are created. b. Samples whose RLE boxplot has an interquartile range (IQR) >0.4 or median >|0.2| (deviates from 0), are considered low-quality. c. Samples whose NUSE boxplot has a median >1.1 (deviates from 1), are considered low-quality.  Note 2: Since the aim of this protocol is to study the condition independent coexpression landscape of a species, the same procedure can be used for other organisms, keeping only healthy samples of distinct tissues.
Pause point: The users can pause and continue the metadata examination step at any point.
9. After quality control is complete, only healthy single-tissue samples remain.  (Sokal and Michener, 1958) and a sample correlation tree is produced which is then automatically trimmed to contain the most representative leaf-samples.
12. PCCs between all pairs of selected samples are calculated with the following formula:   ll OPEN ACCESS r xy = P n i = 1 ðx i À xÞðy i À yÞ ffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffi ffi P n i = 1 ðx i À xÞ 2 • P n i = 1 ðy i À yÞ   b. Calculate sample pairwise r-values and convert them to a distance value with d = 1 -r formula (Kassambara, 2017) in RStudio, using: 14. Create the sample correlation tree with UPGMA hierarchical clustering algorithm, using the following commands in RStudio: ''Zat10-OE-1'', ''Zat10-OE-2'' and ''Zat10-OE-3'' samples should be deleted, as they come from mutated plants, while ''wildtype-1'', ''wildtype-2'' and ''wildtype-3'' samples should be accepted, as they are based on wild-type plants. 15. Sort the produced sample correlation tree using Newick Utilities (Junier and Zdobnov, 2010) in Ubuntu terminal: 16. Prune the tree to a desired number of representative leaves (in our case the <leaf number to remain> is 3500 samples), using an in-house iterative phylogenetic algorithm pruning adjacent leaves: Those samples constitute the representative samples for the coexpression analysis.
Checkpoint: Representative_samples.new is a Newick formatted file that contains as many leaves as the number of remaining samples (in our case 3500).
17. Obtain the leaf-sample names using: and produce Selected Samples indexes using: 18. Empty Selected_Samples Gene coexpression tree creation

Timing: 2 days
By calculating the pairwise Pearson Correlation coefficients between all gene pairs from the representative samples, we can create a gene coexpression distance matrix which will be used as input for the construction of the gene coexpression tree.
19. PCC-based distances between all pairs of selected genes are calculated with the previously mentioned formula (Equation 1). However, in this case:  e. Paste the AGI code on the search field and press Enter. The leaf that corresponds to the gene of interest will be highlighted in yellow f. Using the mouse wheel, zoom to a subtree node containing the highlighted gene of interest g. Left-click the last common ancestral node h. Select Select > Advanced Selection > Select Subnetwork i. Select Select > Invert Selection j. Select Edit > Delete Taxa k. Export the subtree as Newick by clicking File > Export > Newick and saving it as a .new file l. The gene list can be extracted from the subtree file using: Checkpoint: subtree_gene_list.txt is a single column file of s lines, where s is the number of leaves of the subtree Newick-formatted file exported by Dendroscope.

EXPECTED OUTCOMES
The outcome of the protocol is a gene coexpression phylogenetic tree ( Figure 4A) as a Newickformatted file. In this example, the output tree contains 20,430 Arabidopsis thaliana genes which are represented as leaves. Coexpressed genes that may constitute functional partners and, thus, share similar biological functions and metabolic pathways, are grouped together in the same subclade. The tree itself can be viewed by various phylogenetic software supporting Newick-formatted trees.
The list of the genes of a subtree can be used as input in downstream analyses, such as functional network analysis or enrichment analysis. By examining the enriched gene terms of that subtree, new biological functions for those gene sets may be discovered. In addition, it is possible to attribute biological roles to neighboring genes of unknown function.
Functional partners of a gene of interest may be found in its neighboring leaves. In the example coexpression subtree ( Figure 4B), CTL2 (shown by its AGI code: AT3G16920), is grouped with genes which are associated with plant-type secondary wall biogenesis.

QUANTIFICATION AND STATISTICAL ANALYSIS
Enrichment analysis can discover the predominant biological processes of a given coexpressed gene list: 1. String (Szklarczyk et al., 2021) creates a protein-protein interaction (PPI) network using a gene list as input: a. Visit the String website (https://string-db.org/) and click on ''Search'' b. Click on ''Multiple Proteins'' c. Paste the contents of subtree_gene_list.txt file that was created previously to ''List Of Names'' field and select the Organism (in this case Arabidopsis thaliana) d. Click on ''Search'' e. Click on ''Continue'' f. The resulting PPI network of the input genes is displayed g. In ''Analysis'' tab, ''Network Stats'' show the density of the network and consequently indicate the degree of functional interaction between the input genes. ''Functional enrichments in your network'' show the predominant biological terms which are associated with the coexpressed genes. h. In ''Settings'' tab, network parameters, such as interaction sources and interaction thresholds, can be adjusted.

LIMITATIONS
The main limitation of this protocol originates from the transcriptomic technology of microarrays which is not able to study the expression of genes for which no probe is available. Furthermore, cross-hybridization may make false estimations of the gene expression, distorting correlation between members of the same family of genes and other genes. Thus, RNA-seq, having greatly advanced in the latest years, has replaced microarrays as transcriptomic technology of choice, to a large extent. RNA-seq has higher sensitivity and there is a growing amount of data available in public repositories. Nevertheless, it is shown that microarray and RNA-seq-based coexpression analyses produce comparable gene coexpression networks (Malatras et al., 2020;Obayashi et al., 2018). Considering the fact that there is no definitive pipeline to perform coexpression analysis from raw RNA-seq data, microarray data are still relevant, as their normalization algorithms have been tested and perfected throughout the technology's lifetime.
A limitation of gene coexpression tree depiction, used in this protocol, is the fact that it cannot portray anti-coexpressed genes. As gene pairwise r-values are transformed to non-negative distance values, anti-correlated genes are not inferred. Finally, this depiction assumes that one gene may only participate in a single group of functional partners. This limitation contradicts the already known fact that genes may interact with different gene subgroups which are related to different functions.
As far as the execution of this protocol is concerned, advanced programming and database management knowledge is required.

TROUBLESHOOTING
Problem 1 RStudio crashes/displays errors during the creation of the coexpression tree of genes (step 21).

Potential solution
First, make sure that the matrix is formatted correctly for Phangorn and that there are no missing values in the matrix.
When trying to calculate the correlations or produce a tree using a large number of genes (>20,000), R requires a lot of RAM (possibly more than the recommended 64 GB). We recommend closing all other applications that might use memory resources. If the problem persists, the only solution would be to increase the available RAM of the machine.

Problem 2
Error during the installation of Bioconductor packages.

Potential solution
This protocol assumes that the latest available R version is used. However, at some point in time, certain packages may stop being supported. In such case, we recommend installing a (older) version of R that supports the installation of those packages.

Problem 3
Oligo package does not support the creation of additional quality control metrics (apart from RLE and NUSE boxplots) for downloaded samples (step 6).

Potential solution
We propose installing the following packages to produce NUSE and RLE boxplots and Quality Control reports (saved as AffyQCReport.pdf). However, those are only available for non-exon arrays microarray platforms and for older versions of R (<4.0.3).

Problem 4
Newick Utilities cannot be installed/run on my system.

Potential solution
In this protocol, we suggest downloading an already compiled version of Newick Utilities, which is, however, available only through the Wayback Machine. If the link becomes dead, or any other problem occurs, we recommend visiting the official GitHub page of the software (https://github.com/tjunier/ newick_utils). Alternatively, other software, such as Dendroscope, can be used for tree sorting.

Problem 5
The coexpression tree cannot load in Dendroscope (step 23).

Potential solution
Phylogenetic trees with more than 30,000 leaves, require larger amounts of RAM to open in Dendroscope. We suggest increasing the available RAM of Dendroscope or using another tree visualization software.

Problem 6
There is an issue with one or more PHP scripts that are included in the ACT GitHub folder.

Potential solution
The user can report the issue through GitHub.

RESOURCE AVAILABILITY
Lead contact Further information and requests for resources should be directed to and will be fulfilled by the lead contact, Ioannis Michalopoulos (imichalop@bioacademy.gr).

Materials availability
This study did not generate new unique reagents.

Data and code availability
The microarray samples analyzed during the current study are available at: https://data.mendeley. com/datasets/hgvk669v89/ Custom scripts and data used are available in: https://github.com/imichalop/ACT Any additional information required to reanalyze the data reported in this paper is available from the lead contact upon request.